Contextual Dependencies in Unsupervised Word Segmentation

نویسندگان

  • Sharon Goldwater
  • Thomas L. Griffiths
  • Mark Johnson
چکیده

Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model (and previous probabilistic models), demonstrating the importance of such dependencies for word segmentation. We also show that previous probabilistic models rely crucially on suboptimal search procedures.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Word Segmentation for Sesotho Using Adaptor Grammars

This paper describes a variety of nonparametric Bayesian models of word segmentation based on Adaptor Grammars that model different aspects of the input and incorporate different kinds of prior knowledge, and applies them to the Bantu language Sesotho. While we find overall word segmentation accuracies lower than these models achieve on English, we also find some interesting differences in whic...

متن کامل

Unsupervised phonemic Chinese word segmentation using Adaptor Grammars

Adaptor grammars are a framework for expressing and performing inference over a variety of non-parametric linguistic models. These models currently provide state-of-the-art performance on unsupervised word segmentation from phonemic representations of child-directed unsegmented English utterances. This paper investigates the applicability of these models to unsupervised word segmentation of Man...

متن کامل

Recurrent Neural Word Segmentation with Tag Inference

In this paper, we present a Long Short-TermMemory (LSTM) based model for the task of Chinese Weibo word segmentation. The model adopts a LSTM layer to capture long-range dependencies in sentence and learn the underlying patterns. In order to infer the optimal tag path, we introduce a transition score matrix for jumping between tags of successive characters. Integrated with some unsupervised fea...

متن کامل

A joint model of word segmentation and phonological variation for English word-final /t/-deletion

Word-final /t/-deletion refers to a common phenomenon in spoken English where words such as /wEst/ “west” are pronounced as [wEs] “wes” in certain contexts. Phonological variation like this is common in naturally occurring speech. Current computational models of unsupervised word segmentation usually assume idealized input that is devoid of these kinds of variation. We extend a non-parametric m...

متن کامل

An Unsupervised Iterative Method for Chinese New Lexicon Extractio

An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-mergingfiltering-and-disambiguation approach, the proposed method iteratively integrates the contextual constraints (among word candidates) and a joint character association metric to progressively improve the segm...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006